feat(consensus): persist standalone WAL with crash-safe replay#3763
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3763 +/- ##
==========================================
+ Coverage 75.30% 75.37% +0.06%
==========================================
Files 433 438 +5
Lines 38664 39463 +799
==========================================
+ Hits 29116 29744 +628
- Misses 7547 7647 +100
- Partials 2001 2072 +71 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
e8c340a to
0383582
Compare
|
Claude finished @ongyimeng's task in 6m 35s —— View job PR Review: Standalone consensus WAL
Reviewed the full change set with a focus on consensus-recovery correctness, durability ordering, concurrency, and resource handling. This is a well-engineered PR — the durability invariants are sound and the failure-injection test coverage is unusually thorough. I have no blockers. BlockerNone. ImportantNone. A few things I verified that are correct and worth recording:
Nit
Test quality is good: channel-based synchronization (no |
|
Claude finished @ongyimeng's task in 2m 1s —— View job Re-review: standalone consensus WAL (synchronize)Re-reviewed the three commits pushed since the last review. They directly address the prior nits, and the changed packages build cleanly.
VerdictAll three changes are correct. No blockers, no new issues.
New decode-path tests ( Nit (non-blocking, optional)
Overall: the PR remains in good shape — durability invariants intact, the prior review's concrete items are now closed. |
|
Claude finished @ongyimeng's task in 2m 27s —— View job PR Re-review: standalone consensus WAL (synchronize)Re-reviewed the commits pushed since the last review:
These directly address @rodrodros's review nits. Affected packages ( VerdictAll changes are correct. No blockers, no new issues.
The error-message wording flagged earlier ( Overall: the PR remains in good shape. The durability invariants verified in the earlier reviews are untouched, and the refactors are behavior-preserving. |
|
Claude finished @ongyimeng's task in 2m 44s —— View job PR Re-review: standalone consensus WAL (synchronize)Reviewed the commits pushed since the last review:
These address @rodrodros's remaining review nits. VerdictAll changes are correct and behavior-preserving. No blockers, no new issues. Package rename Removed nil guards in Removed
Error wording ( Overall: the PR remains in good shape. The durability/replay invariants verified in earlier reviews are untouched, and this batch is a clean set of nit-resolving refactors. |
|
Please hold on for abit, I’m still doing some final checks and cleaning up a few things. |
|
Claude finished @ongyimeng's task in 2m 41s —— View job PR Re-review: standalone consensus WAL (synchronize)Reviewed the commits pushed since the last review:
These close out the last of @rodrodros's review nits. VerdictAll changes are correct and behavior-preserving. No blockers, no new issues.
The helper only reads
So the two callers —
Overall: the PR remains in good shape. The durability/replay invariants verified in the earlier reviews are untouched, and this batch is a clean set of nit-resolving refactors. |
|
Claude finished @ongyimeng's task in 1m 51s —— View job PR Re-review: standalone consensus WAL (synchronize)Reviewed the two commits pushed since the last review:
VerdictBoth changes are correct and behavior-preserving. No blockers, no new issues.
Overall: the PR remains in good shape. The durability/replay invariants verified in earlier reviews are untouched, and this batch is a clean readability + test-hardening pass. • Branch: |
|
Claude finished @ongyimeng's task in 2m 32s —— View job PR Re-review: standalone consensus WAL (synchronize)Reviewed the change pushed since the last review:
The only new commit since the last review is the VerdictNo blockers, no new issues. The WAL durability/replay invariants verified across the earlier reviews are unchanged. Merge adaptations — within the PR's scope, the merge only removed a now-deleted
These are purely mechanical field removals tracking the upstream deletion of Build — Overall: the merge is clean and the PR remains in good shape. Nothing further to flag. |
Summary
Replaces the consensus WAL, previously stored as generic CBOR-encoded values in the shared Pebble KV store, with a dedicated append-only WAL owned by consensus. The new WAL uses typed records, segment files, pruning, and crash-safe replay.
Why
Consensus relies on WAL replay to recover after a crash mid-height. As generic KV values, that recovery path was harder to inspect and debug, and was coupled to the shared KV store. A standalone WAL gives consensus an explicit durability boundary with clearer replay and failure handling.
Issue #3114
What changed
Startentry replay. It already existed as an entry type, but the old generic CBOR decoding couldn't reconstruct it on replay.Performance
Headline comparison is
BenchmarkTendermintWALStoreSustainedHeightLatency, which is closest to the consensus write/delete flow: 1000 heights, real cleanup frequency,count=5.Medians:
Same latency at p50/max, better p95/p99, fewer allocations.
B/opandallocs/opare per 1000-height run, since Go counts the loop as one op under-benchtime=1x.Note: This benchmark uses 1,000 heights, which fits within Pebble’s in-memory write path. It does not exercise the old KV-backed WAL’s eventual flush, compaction, or tombstone-reclamation costs. There is no steady-state latency regression at this scale, but it is not representative of recurring background costs in a long-running node which is opaque and state dependent.
Testing
The remaining coverage gaps are mostly rare internal error paths. The main WAL recovery flows are already covered, so extra tests would add little value.